Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)After skimming the data, wer can see that there are no missing values in the dataset. The variable types are:
| Column name | country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol |
|---|---|---|---|---|---|
| Variable type | character | integer | integer | integer | double |
glimpse(drinks)## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "And…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, 2…
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75,…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 191…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8, …
skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
We will now make a plot that shows the top 25 beer consuming countries-
drinks %>%
slice_max ( order_by = beer_servings, n=25 ) %>% # taking top 25 countries by servings
#plotting a graph
ggplot(aes(x = beer_servings, y = fct_reorder(country, beer_servings))) +
geom_col(fill="orange") +
#labelling the graph and the axes
labs(
title = "Top 25 Beer Consuming Countries in 2010",
subtitle = "Standard Servings Per Person",
x = "Beer Servings (in cans)",
y = "Country"
) We will now make a plot that shows the top 25 wine consuming countries-
drinks %>%
slice_max ( order_by = wine_servings, n=25 ) %>% # taking top 25 countries by servings
ggplot(aes(x = wine_servings, y = fct_reorder(country, wine_servings))) +
geom_col(fill="dark red") +
labs(
title = "Top 25 Wine Consuming Countries in 2010",
subtitle = "Standard Servings Per Person",
x = "Wine Servings (in glasses)",
y = "Country"
)Finally, make a plot that shows the top 25 spirit consuming countries
drinks %>%
slice_max ( order_by = spirit_servings, n=25 ) %>% # taking top 25 countries by servings
ggplot(aes(x = spirit_servings, y = fct_reorder(country, spirit_servings))) +
geom_col(fill="grey") +
labs(
title = "Top 25 Spirit Consuming Countries in 2010",
subtitle = "Servings (in shots) Per Person",
x = "Spirit Servings",
y = "Country"
) > Inferences & Key takeaways-
European countries are high consumers of wine.
Beer is more evenly distributed around the world in the top 25 countries, as compared to wine and spirit.
European countries are higher ranked for overall consumption of beer, wine and spirit.
We see the countries with higher population are lower on the graphs, since the data is plotted per population. Hence the distribution of ages in the population within each country will also affect the alcohol, wine and spirits consumption.
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast members receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingFrom the dataset, we can see that: 1. There are no missing values in the dataset. 2. The movies with duplicated entries. There are Duplicates (2907 distinct titles in 2961 rows) The following piece of code shows how we have cleaned it.
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
# no missing values. There are Duplicates (2907 distinct titles in 2961 rows).
skim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
# show the duplicate movies
movies %>% count(title, sort=T)## # A tibble: 2,907 × 2
## title n
## <chr> <int>
## 1 Home 3
## 2 A Nightmare on Elm Street 2
## 3 Across the Universe 2
## 4 Alice in Wonderland 2
## 5 Aloha 2
## 6 Around the World in 80 Days 2
## 7 Brothers 2
## 8 Carrie 2
## 9 Chasing Liberty 2
## 10 Cinderella 2
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Homes")## # A tibble: 0 × 11
## # … with 11 variables: title <chr>, genre <chr>, director <chr>, year <dbl>,
## # duration <dbl>, gross <dbl>, budget <dbl>, cast_facebook_likes <dbl>,
## # votes <dbl>, reviews <dbl>, rating <dbl>
# `distinct` function can only keep the first entry but not latest
# movies <- distinct(movies, title, .keep_all=T)
length(unique(movies$title))## [1] 2907
movies <- movies %>%
group_by(title) %>%
filter(votes == max(votes)) %>%
ungroup()
# there are still duplicates
movies %>% count(title, sort=T)## # A tibble: 2,907 × 2
## title n
## <chr> <int>
## 1 Chasing Liberty 2
## 2 10 Cloverfield Lane 1
## 3 10 Days in a Madhouse 1
## 4 10 Things I Hate About You 1
## 5 102 Dalmatians 1
## 6 10th & Wolf 1
## 7 12 Rounds 1
## 8 12 Years a Slave 1
## 9 127 Hours 1
## 10 13 Going on 30 1
## # … with 2,897 more rows
# to see what happens with the duplicates
movies %>% filter(title=="Chasing Liberty")## # A tibble: 2 × 11
## title genre director year duration gross budget cast_facebook_l… votes
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chasing… Comedy Andy Cad… 2004 101 1.22e7 2.3e7 842 30092
## 2 Chasing… Comedy Andy Cad… 2004 101 1.22e7 2.3e7 829 30092
## # … with 2 more variables: reviews <dbl>, rating <dbl>
# do the filter only for the entries of Chasing Liberty
movies <- movies %>%
group_by(title) %>%
filter(cast_facebook_likes==max(cast_facebook_likes)) %>%
ungroup()
skim(movies)| Name | movies |
| Number of rows | 2907 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.92e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.23e+01 | 37.0 | 9.50e+01 | 1.05e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.76e+07 | 7.23e+07 | 703.0 | 1.20e+07 | 3.45e+07 | 7.51e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.02e+07 | 4.32e+07 | 218.0 | 1.10e+07 | 2.50e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.23e+04 | 2.05e+04 | 0.0 | 2.22e+03 | 4.54e+03 | 1.68e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.59e+05 | 5.0 | 1.95e+04 | 5.47e+04 | 1.32e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 4.98e+02 | 4.93e+02 | 2.0 | 1.97e+02 | 3.58e+02 | 6.24e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.06e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
The following table shows the count of movies by genre, ranked in descending order
movies %>% count(genre, sort = TRUE)## # A tibble: 17 × 2
## genre n
## <chr> <int>
## 1 Comedy 844
## 2 Action 719
## 3 Drama 484
## 4 Adventure 281
## 5 Crime 198
## 6 Biography 135
## 7 Horror 128
## 8 Animation 35
## 9 Fantasy 26
## 10 Documentary 25
## 11 Mystery 15
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
Here we have a table with the average gross earning and budget (gross and budget) by genre. We have calculated a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. We have ranked genres by this return_on_budget in descending order-
movies %>%
mutate(movies_return = gross/budget ) %>%
group_by(genre) %>%
summarise(avg_gross = mean(gross),
avg_budget = mean(budget),
genre_return_on_budget = sum(gross)/sum(budget),
movie_mean_return_on_budget = mean(movies_return)) %>%
arrange(-movie_mean_return_on_budget)## # A tibble: 17 × 5
## genre avg_gross avg_budget genre_return_on_budget movie_mean_return_o…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Horror 37782310. 13804379. 2.74 86.1
## 2 Biography 45201805. 28543696. 1.58 22.3
## 3 Musical 92084000 3189500 28.9 18.8
## 4 Family 149160478. 14833333. 10.1 14.1
## 5 Documentary 17353973. 5887852. 2.95 8.70
## 6 Western 20821884 3465000 6.01 7.06
## 7 Fantasy 41902674. 18484615. 2.27 6.10
## 8 Animation 98433792. 61701429. 1.60 5.01
## 9 Comedy 42487808. 24458506. 1.74 3.70
## 10 Romance 31264848. 25107500 1.25 3.17
## 11 Drama 36754959. 25832605. 1.42 2.98
## 12 Mystery 69117136. 41500000 1.67 2.90
## 13 Adventure 94350236. 64692313. 1.46 2.44
## 14 Crime 37601525. 26527405. 1.42 2.19
## 15 Action 86270343. 70774558. 1.22 1.93
## 16 Sci-Fi 29788371. 27607143. 1.08 1.58
## 17 Thriller 2468 300000 0.00823 0.00823
Here we have a table that shows the top 15 directors who have created the highest gross revenue in the box office. We have shown the total gross amount, the mean, median, and standard deviation per director.
movies %>%
group_by(director) %>%
summarise(total_gross = sum(gross),
mean_gross = mean(gross),
median_gross = median(gross),
standard_dev_gross = sd(gross)) %>%
slice_max ( order_by = total_gross, n = 15)## # A tibble: 15 × 5
## director total_gross mean_gross median_gross standard_dev_gross
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051.
## 2 Michael Bay 2195443511 182953626. 168468240. 125789167.
## 3 James Cameron 1909725910 318287652. 175562880. 309171337.
## 4 Christopher Nolan 1813227576 226653447 196667606. 187224133.
## 5 George Lucas 1741418480 348283696 380262555 146193880.
## 6 Robert Zemeckis 1619309108 124562239. 100853835 91300279.
## 7 Tim Burton 1557078534 111219895. 69791834 99304293.
## 8 Sam Raimi 1443167519 180395940. 138480208 174705230.
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408.
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020.
## 11 Ron Howard 1335988092 111332341 101587923 81933761.
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822.
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765.
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773.
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285.
We have produced a table that describes how ratings are distributed by genre. The histogram visually shows how ratings are distributed.
movies_rating <- movies %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
min_rating = min(rating),
max_rating = max(rating),
sd_rating = sd(rating))
movies_rating## # A tibble: 17 × 5
## genre mean_rating min_rating max_rating sd_rating
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Action 6.23 2.1 9 1.04
## 2 Adventure 6.51 2.3 8.6 1.11
## 3 Animation 6.65 4.5 8 0.968
## 4 Biography 7.11 4.5 8.9 0.760
## 5 Comedy 6.11 1.9 8.8 1.02
## 6 Crime 6.92 4.8 9.3 0.853
## 7 Documentary 6.66 1.6 8.5 1.77
## 8 Drama 6.74 2.1 8.8 0.915
## 9 Family 6.5 5.7 7.9 1.22
## 10 Fantasy 6.08 4.3 7.9 0.953
## 11 Horror 5.79 3.6 8.5 0.987
## 12 Musical 6.75 6.3 7.2 0.636
## 13 Mystery 6.84 4.6 8.5 0.910
## 14 Romance 6.65 6.2 7.1 0.636
## 15 Sci-Fi 6.66 5 8.2 1.09
## 16 Thriller 4.8 4.8 4.8 NA
## 17 Western 5.7 4.1 7.3 2.26
movies %>%
ggplot(mapping = aes(x = rating)) +
geom_histogram(bins=30) +
facet_wrap(~genre)+
labs(title = "Distribution of ratings in each genre",
x = "Rating (1-10)",
y = "Num of movies") +
NULLggplot to find relationships between variablesUnderstanding the correlation between gross and cast_facebook_likes. We have produced a scatterplot with Facebook Likes on the X-Axis and Gross Revenue on the Y-Axis.
ggplot(movies, aes(x = cast_facebook_likes, y = gross)) +
geom_point() +
geom_smooth(method = "lm")+
labs(
title = "Relationship of Facebook Likes vs Gross Revenue of the Movie",
x = "Facebook Likes",
y = "Gross Revenue"
)+
NULL We analyze the following from the graph below-
Now we examine the relationship between gross and budget by creating a scatterplot.
ggplot(movies, aes(x = budget , y = gross)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
title = "Relationship of Gross Revenue vs Budget of the Movie",
x = "Movie Budget",
y = "Gross Revenue"
)+
NULLFrom the plot above we see that, the budget and gross do seem correlated. The higher the budget, it is more likely that the gross may be higher.
Furthermore, we examine the relationship between gross and rating. Segmenting the scatterplot by ‘genre’, we can see the following results-
ggplot(movies, aes(x = rating , y = gross)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~genre) +
labs(title = "Gross vs Rating of Movies For Each Genre ",
x = "Rating",
y = "Gross") +
NULLWe can see that:
The higher the rating the more will be the gross for the most genres of movies.
For movies of some genres like ‘Documentary’, ‘Mystery’, ‘Horror’ and ‘Sci-Fi’, the gross has a very less change with respect to rating. Documentaries certainly have a different business model.
Negative correlation even appears.
Sample size of genres like ‘Family’, ‘Romance’ , ‘Musical’ is very small with under three values.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.
nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)## Rows: 508
## Columns: 6
## $ symbol <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…
Based on this dataset, we have created a table and a bar plot that shows the number of companies per sector, in descending order
# a easier way
# nyse %>%
# select(sector) %>%
# table() %>%
# sort(decreasing = T) %>%
# barplot()
nyse %>%
group_by(sector) %>%
mutate(company_num = count(sector)) %>%
ggplot(aes(x=company_num, y=fct_reorder(sector, company_num))) +
geom_bar(stat="identity") +
labs(title = "Number of Companies in each Sector",
x = "",
y = "Sector")Next, we have chosen some stocks and their ticker symbols and downloaded data. The stocks we chose are: “EBR”,“BBL”,“AEE”,“BCE”,“BRO”,“CAT”,“BUD”,“SPY”
# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming,
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- c("EBR","BBL","AEE","BCE","BRO","CAT","BUD","SPY" ) %>%
tq_get(get = "stock.prices",
from = "2011-01-01",
to = "2021-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 21,464
## Columns: 8
## Groups: symbol [8]
## $ symbol <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open <dbl> 13.9, 14.1, 14.3, 14.3, 14.0, 13.8, 13.8, 13.8, 13.9, 14.0, 1…
## $ high <dbl> 14.0, 14.3, 14.5, 14.3, 14.0, 13.9, 13.8, 13.9, 14.1, 14.3, 1…
## $ low <dbl> 13.9, 14.0, 14.3, 14.0, 13.8, 13.7, 13.6, 13.7, 13.8, 13.9, 1…
## $ close <dbl> 14.0, 14.2, 14.4, 14.1, 14.0, 13.7, 13.7, 13.8, 14.0, 14.2, 1…
## $ volume <dbl> 878300, 802900, 950300, 855800, 586000, 1020300, 1545800, 112…
## $ adjusted <dbl> 9.85, 10.00, 10.13, 9.92, 9.82, 9.63, 9.65, 9.71, 9.84, 10.03…
We can see that the dataset is 8x21,464 tibble with each row being the ohlc (open,high,low,close) and volume for a stock on a given date.
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))We have created a table summarising monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
glimpse(myStocks_returns_monthly)## Rows: 1,024
## Columns: 3
## Groups: symbol [8]
## $ symbol <chr> "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR", "EBR"…
## $ date <date> 2011-01-31, 2011-02-28, 2011-03-31, 2011-04-29, 2011-…
## $ monthly_returns <dbl> -0.0264, 0.0487, 0.0861, -0.0445, -0.0263, -0.0264, -0…
monthlystocks_summarised <- myStocks_returns_monthly %>%
group_by(symbol) %>%
summarise(min_return = min(monthly_returns),
max_return = max(monthly_returns),
median_return = median(monthly_returns),
mean_return = mean(monthly_returns),
sd_return = sd(monthly_returns))Plotted a density plot, using geom_density(), for each of the stocks
ggplot(myStocks_returns_monthly, aes(x = monthly_returns)) +
geom_density() +
facet_wrap(~symbol) +
labs(
title = "Density Plots for Monthly Returns of Stocks",
x = "Monthly Return",
y = "Density"
) +
NULLInferences from the plots
The monthly returns for the flatter density plots are more dispersed whereas those with tall peaks are more concentrated. The riskiest stock is EBR and the least risky is SPY (as an ETF) due to the shape of their peaks.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
monthlystocks_summarised%>%
ggplot(aes(y = mean_return, x=sd_return)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = symbol)) +
NULLInferences from the plots
EBR is the most risky as it has the highest standard deviation of returns. EBR has the highest sd but also a high expected return. AEE, CAT and SPY also have high expected returns with less standard deviation. BBL and BUD have lower expected returns and higher standard deviations meaning they are riskier and do not have high expected returns. SPY produces a good return with lower risk.
We analyse a data set on Human Resource Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
First let us load the data:
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
Cleaning the data:
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)glimpse(hr_cleaned)## Rows: 1,470
## Columns: 19
## $ age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35,…
## $ attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No",…
## $ daily_rate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 135…
## $ department <chr> "Sales", "Research & Development", "Researc…
## $ distance_from_home <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26…
## $ education <chr> "College", "Below College", "College", "Mas…
## $ gender <chr> "Female", "Male", "Male", "Female", "Male",…
## $ job_role <chr> "Sales Executive", "Research Scientist", "L…
## $ environment_satisfaction <chr> "Medium", "High", "Very High", "Very High",…
## $ job_satisfaction <chr> "Very High", "Medium", "High", "High", "Med…
## $ marital_status <chr> "Single", "Married", "Single", "Married", "…
## $ monthly_income <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2…
## $ num_companies_worked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5…
## $ percent_salary_hike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13,…
## $ performance_rating <chr> "Excellent", "Outstanding", "Excellent", "E…
## $ total_working_years <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5,…
## $ work_life_balance <chr> "Bad", "Better", "Better", "Better", "Bette…
## $ years_at_company <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, …
## $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0…
attrition)# 1233 employees stay while 237 left (19.2%).
hr_cleaned %>%
group_by(attrition)%>%
summarise(count = count(attrition))## # A tibble: 2 × 2
## attrition count
## <chr> <int>
## 1 No 1233
## 2 Yes 237
# to see the how attrition rate changes with years at company
prop.table(table(hr_cleaned[,c("years_at_company","attrition")]),1)[,2]%>%
barplot(main="Attrition Rate vs Years At Company" , xlab="Years at Company" , ylab="Attrition Rate")
abline(h=0.192,col="red") # avergae attrition rateage, years_at_company, monthly_income and years_since_last_promotion, we can see from the histograms from the summary statistics, the only variable that looks normally distributed is age.skim(hr_cleaned)| Name | hr_cleaned |
| Number of rows | 1470 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| education | 0 | 1 | 6 | 13 | 0 | 5 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| job_role | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| environment_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| job_satisfaction | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| marital_status | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| performance_rating | 0 | 1 | 9 | 11 | 0 | 2 | 0 |
| work_life_balance | 0 | 1 | 3 | 6 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 36.92 | 9.14 | 18 | 30 | 36 | 43 | 60 | ▂▇▇▃▂ |
| daily_rate | 0 | 1 | 802.49 | 403.51 | 102 | 465 | 802 | 1157 | 1499 | ▇▇▇▇▇ |
| distance_from_home | 0 | 1 | 9.19 | 8.11 | 1 | 2 | 7 | 14 | 29 | ▇▅▂▂▂ |
| monthly_income | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911 | 4919 | 8379 | 19999 | ▇▅▂▁▂ |
| num_companies_worked | 0 | 1 | 2.69 | 2.50 | 0 | 1 | 2 | 4 | 9 | ▇▃▂▂▁ |
| percent_salary_hike | 0 | 1 | 15.21 | 3.66 | 11 | 12 | 14 | 18 | 25 | ▇▅▃▂▁ |
| total_working_years | 0 | 1 | 11.28 | 7.78 | 0 | 6 | 10 | 15 | 40 | ▇▇▂▁▁ |
| years_at_company | 0 | 1 | 7.01 | 6.13 | 0 | 3 | 5 | 9 | 40 | ▇▂▁▁▁ |
| years_since_last_promotion | 0 | 1 | 2.19 | 3.22 | 0 | 0 | 1 | 3 | 15 | ▇▁▁▁▁ |
job_satisfaction and work_life_balancehr_cleaned %>%
group_by(job_satisfaction)%>%
summarise(countjs = n(),
percentagejs = countjs/nrow(hr_cleaned)*100)## # A tibble: 4 × 3
## job_satisfaction countjs percentagejs
## <chr> <int> <dbl>
## 1 High 442 30.1
## 2 Low 289 19.7
## 3 Medium 280 19.0
## 4 Very High 459 31.2
hr_cleaned %>%
group_by(work_life_balance)%>%
summarise(countwlb= n(),
percentagewlb = round(countwlb/nrow(hr_cleaned)*100,2))## # A tibble: 4 × 3
## work_life_balance countwlb percentagewlb
## <chr> <int> <dbl>
## 1 Bad 80 5.44
## 2 Best 153 10.4
## 3 Better 893 60.8
## 4 Good 344 23.4
ggplot(hr_cleaned, aes(x = fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College"), y = monthly_income)) +
geom_boxplot()+
labs(title = "Boxplot of Monthly Income against Education",
x = "Education",
y = "Monthly Income")+
NULLggplot(hr_cleaned, aes(x = monthly_income, y = gender)) +
geom_boxplot()+
labs(title = "Boxplot of Monthly Income against Gender",
x = "Monthly Income",
y = "Gender")+
NULLggplot(hr_cleaned, aes(x=fct_reorder(job_role,-monthly_income), y=monthly_income)) +
geom_boxplot() +
labs(title = "Boxplot of Monthly Income against Job Role",
x = "Job Role",
y = "Monthly Income")+
NULLhr_cleaned %>%
group_by(education) %>%
summarise(medianinc = median(monthly_income),
meaninc = mean(monthly_income)) %>%
ggplot(aes(x = fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College"),
y=meaninc)) +
geom_bar(stat = "identity") +
labs(title = "Average Monthly Income of Each Education Level",
x = "Education",
y = "Average Monthly Income")+
NULLhr_cleaned %>%
ggplot(aes(x=monthly_income)) +
geom_histogram(bins=30)+
facet_wrap(~fct_relevel(education,
"Doctor", "Master", "Bachelor",
"College", "Below College")) +
theme_wsj() +
NULLjob_rolehr_cleaned %>%
ggplot(aes(y=monthly_income, x=age)) +
geom_point() +
geom_smooth(method="lm")+
facet_wrap(~job_role) +
theme_wsj() +
NULLThe purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.
# 3154 unique fips code in election2020_results
length(unique(election2020_results$fips))## [1] 3154
# check unique values of candidate names
unique(election2020_results$candidate)## [1] "JOSEPH R BIDEN JR" "OTHER" "DONALD J TRUMP"
## [4] "JO JORGENSEN"
data <- election2020_results %>%
mutate(votes_percentage = candidatevotes/totalvotes) %>% # calculate percentage
filter(candidate=="DONALD J TRUMP", mode=="TOTAL") %>% # we only need Trump votes
inner_join(vaccinations,by = "fips") %>% # inner join with vaccinations data
inner_join(population,by = "fips") # inner join with population data
# generate graph below
# install.packages("ggpubr") # to show equation easily
library(ggpubr)
# calculate actual vaccination percentage
actual = sum(data$series_complete_yes)/sum(data$pop_estimate_2019)
data %>%
# filter out 0% vaccinated points
filter(series_complete_pop_pct>0) %>%
ggplot() +
geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100,size=pop_estimate_2019),shape=21,fill="light blue")+
# scale circle size
scale_size(range = c(0, 20)) +
# add points in the center of circles
geom_point(aes(x=votes_percentage,y=series_complete_pop_pct/100),size=0.5)+
# add regression line
geom_smooth(aes(x=votes_percentage,y=series_complete_pop_pct/100),
method="lm",linetype="dotted",colour="blue",se=FALSE)+
# add the equation of regression line
stat_regline_equation(aes(x=votes_percentage,y=series_complete_pop_pct/100),
label.y = 0.1,colour="red",fontface = "bold.italic") +
# add r square
stat_cor(aes(x=votes_percentage,y=series_complete_pop_pct/100,label = paste(..rr.label..)),
label.y = 0.05,colour="red")+
# add date
geom_text(aes(x=0.3, y=0.07,label = "09/03/2021",
fontface = "bold.italic"),colour="red")+
# add horizonal lines below
geom_hline(aes(yintercept=0.85), linetype=2) + # herd immunity line
geom_text(aes(x=0, y=0.85, label = "Herd Immunity threshold (?)",
vjust=-1, hjust=0, fontface = "bold.italic"),colour="blue") +
geom_hline(aes(yintercept=0.539), linetype=2) + # Target line
geom_text(aes(x=0, y=0.539,label = "TARGET: 53.9%",
vjust=-1,hjust=0, fontface = "bold.italic"),colour="blue") +
geom_hline(aes(yintercept=actual), linetype=2) + # actual line
geom_text(aes(x=0, y=actual, label = paste("ACTUAL: ", round(actual*100,1),"%"),
vjust=-1,hjust=0,fontface = "bold.italic"),
colour="blue") +
# adjuct grid lines
scale_x_continuous(label=scales::percent_format(accuracy = 1),
breaks=seq(0,1,0.05),limits = c(0,1))+
scale_y_continuous(label=scales::percent_format(accuracy = 1),
breaks=seq(0,1,0.05),limits = c(0,1))+
geom_text(aes(x=0.5,y=1, label = "EACH U.S. COUNTY",vjust=0.5,hjust=0.5,
family="mono",fontface = "bold"),
size=5)+
labs(title = "COVID-19 VACCINATION LEVELS OUT OF TOTAL POPULATION BY COUNTY",
x = "2020 Trump Vote %",
y = "% of Total Population Vaccinated")+
# remove the legend
theme(legend.position = "none")+
NULLThe Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.
The following code will scrape the wikipedia page and import the table in a dataframe.
url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor
# get tables that exist on wikipedia page
tables <- url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called polls
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>%
html_table(fill=TRUE)%>%
janitor::clean_names())
# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
slice(2:(n()-1)) %>% # drop the first row, as it contains again the variable names and last row that contains 2017 results
mutate(
# polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
# and we extract it by picking the last 11 characters from that field
end_date = str_sub(fieldwork_date, -11),
# end_date is still a string, so we convert it into a date object using lubridate::dmy()
end_date = dmy(end_date),
# we also get the month and week number from the date, if we want to do analysis by month- week, etc.
month = month(end_date),
week = isoweek(end_date)
)
german_election_polls %>%
ggplot() +
# union
geom_point(aes(x=end_date,y=union, colour="UNION"),shape=21)+
# span=0.2 to make the line less smoothed
geom_smooth(aes(x=end_date,y=union, colour="UNION"),se=F,span = 0.2)+
# spd
geom_point(aes(x=end_date,y=spd, colour="SPD"),shape=21)+
geom_smooth(aes(x=end_date,y=spd,colour="SPD"),se=F,span = 0.2)+
# afd
geom_point(aes(x=end_date,y=af_d,colour="AFD"),shape=21)+
geom_smooth(aes(x=end_date,y=af_d,colour="AFD"),se=F,span = 0.2)+
#fdp
geom_point(aes(x=end_date,y=fdp,colour="FDP"),shape=21)+
geom_smooth(aes(x=end_date,y=fdp,colour="FDP"),se=F,span = 0.2)+
#grune
geom_point(aes(x=end_date,y=grune, colour="GRUNE"),shape=21)+
geom_smooth(aes(x=end_date,y=grune,colour="GRUNE"),se=F,span = 0.2)+
#linke
geom_point(aes(x=end_date,y=linke,colour="LINKE"),shape=21)+
geom_smooth(aes(x=end_date,y=linke,colour="LINKE"),se=F,span = 0.2)+
#display every month
scale_x_date(date_labels="%b %y",date_breaks ="1 month")+
labs(
x="Date",
y="Votes %"
)+
scale_colour_manual("",
breaks = c("UNION","SPD","AFD","FDP","GRUNE","LINKE"),
values = c("black","red","blue","yellow","dark green","purple"))+
NULLTeam Members: Alex Kubbinga, Clara Moreno Sanchezu, Jean Huang, Raghav Mehta, Raina Doshi, Yuan Gao
Time Spent: 8 hours
Additional Information: We have gone through online documentation available on google and stackoverflow.
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.